Classification With Model Interpretation 💯 💯

Importing Modules

Reading Data

Shape

Distribution of Target Variable

Class imbalance can be seen here. Also there 8 categories, lets combine them to 3 categories

Response 8 has highest values and 3 has the least

Processing Target Variable

Still some imbalance can be seen

Removing old target variable

Making categorical and numerical columns list

Visualizations On Categorical Features

D3 has the highest frequencies

Most of the features here are unbalanced.

Right skewed.

Outliers can be seen.

Checking Correlation For Features greater than .8

CONCLUSION

BMI and Weight are highly correlated, which makes sense also as these 2 features are directly proprtional.

Ins_Age and Family_Hist_4, Family_Hist_2 highly correlated

Although, I am not going to perform any transformation on any feature or drop any as these are tree based models and they don't get affected by correlation much because of their non parametric nature.

Null Value Check

Removing unimportant column

X and Y split

Filling Remaining Missing Values

Train Test Split

Shapes of Train and Test Data

Some Important functions that I will be using throughout

Random Forest

Feature Importance For Random Forest

Plotting only those features which are contributing something

CONCLUSION:

BMI, weight, Medical_History_23, Medical_History_4 and Medical_Keyword_15 seems to be important features according to random forest.

Also, only these features are contributing to the model prediction. Some features can be elmininated which are not contributing on further investigation.

Model Interpretability For Random Forest

Using Lime

Using Shap

Findings

Medical keyword 15,medical history 9, Wt, medical history 3 all pushing towards 1.

Orange ones are pusing towards 1.

Dependence Plots

Findings

With high medical history 23 and low bmi we get class 1

Gradient Boosting

Feature Importance For Gradient Boosting

CONCLUSION:

BMI, weight, Medical_History_23, Medical_History_4 and Medical_Keyword_15 seems to be the most important 5 features according to Gradient boosting.

Model Interpretability For Gradient Boosting

Using Lime

Using Shap

Findings

BMI is pushing models prediction towards 0.

Medical keyword 15 is pushing towards 1. However, medical keyword 4 is pushing towards 0.

Also, according to feature plot Wt. was in top 5 most important features, same isn't followed here.

Dependence Plots

Findings

For low BMI and high medical history 23 we get class as 1.

XGBOOST

Feature Importance For XGBoost

Conclusion:

Same trend is seen here.

They all are giving similar scores also so it could be that same features are contributing the most thus similar scores.

Model Interpretability for XGBoost

Using Shap

Again BMI is pushing towards class 0.

MEdical history 4 pushing towards class 1.

Dependence Plots

For product info 4 and wt we see some interesting trend

Logistic Regression

Feature Importance For Logistic Regression

Conclusion

And again the same pattern when doing feature importance

Model Interpretability for logistic regression

Using Lime

Findings

Only BMI and medical history 4 pushing towards class 0

Max Voting Model

Stacked Model

Models And Their Accuracies

Final Results

Gradient Boosting, Voting Classifier and Stacked models are performing really well. Their train and test errors and also the roc scores and f scores are really close and good.